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Abstract — Breadth First Search (BFS) is widely used for mea- 
suring large unknown graphs, such as Online Social Networks. It 
has been empirically observed that an incomplete BFS is biased 
toward high degree nodes. In contrast to more studied sampling 
techniques, such as random walks, the precise bias of BFS has 
not been characterized to date. 

In this paper, we quantify the degree bias of BFS sampling. In 
particular, we calculate the node degree distribution expected to 
be observed by BFS as a function of the fraction of covered nodes, 
in a random graph RG{pk) with a given degree distribution pk- 
Furthermore, we also show that, for RG{pk), all commonly used 
graph traversal techniques (BFS, DFS, Forest Fire, and Snowball 
Sampling) lead to the same bias, and we show how to correct 
for this bias. To give a broader perspective, we compare this 
class of exploration techniques to random walks that are well- 
studied and easier to analyze. Next, we study by simulation the 
effect of graph properties not captured directly by our model. 
We find that the bias gets amplified in graphs with strong 
positive assortativity. Finally, we demonstrate the above results 
by sampling the Facebook social network, and we provide some 
practical guidelines for graph sampling in practice. 

Index Terms — BFS, Breadth First Search, graph sampling, 
degree bias. Online Social Networks (OSN). 

I. Introduction 

A large body of work in the networking community focuses 
on topology measurements at various levels, including the 
Internet, the Web (WWW), peer-to-peer (P2P) and online 
social networks (OSN). The size of these networks and other 
practical restrictions make measuring the entire graph impos- 
sible. Instead, researchers typically collect and study a small 
but "representative" sample. In this paper, we are particularly 
interested in sampling networks that naturally allow to explore 
the neighbors of a given node (which is the case in WWW, P2P 
and OSN). A number of graph exploration techniques use this 
basic operation for sampling. They can be roughly classified 
in two categories: (a) with replacement (random walks), and 
(b) without replacement (graph traversal techniques). 

In the first category, random walks, nodes can be revisited. 
This category includes the classic Random Walk (RW) as well 
as the Metropolis-Hastings Random Walk (MHRW). They are 
used for sampling of nodes on the Web IT], P2P networks 
m, OSNs II5I6I and large graphs in general f7l|. Random walks 
are well studied [HI and result in samples that have either no 
bias (MHRW) or a known bias (RW) that can be corrected 
for Random walks are not the focus of this paper, but are 
discussed as baseline for comparison. 

In the second category, graph traversal techniques, each 
node is visited exactly once (if we let the process run until 
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Fig. 1. Overview of results. We calculate the average node degree (k*) 
(and the full degree distribution, not shown) expected to be observed by BFS 
in a random graph RG{pk) with a given degree distribution p^, as a function 
of the fraction of sampled nodes /. We show RW and MHRW as a reference. 
{k) is the real average node degree, and [k'^) is the real average squared node 
degree. Obserx'ations: (1) For a small sample size, BFS has the same 
bias as RW; with increasing /, the bias decreases; a complete BFS (/= 1) 
is unbiased, as is MHRW (or uniform samphng). (2) All common graph 
traversal techniques (that do not revisit the same node) lead to the same 
bias. (3) The shape of the BFS curve depends on the real node degree 
distribution p^, but it is always monotonically decreasing. 



completion). These methods vary in the order in which they 
visit the nodes; examples include BFS, Depth-First Search 
(DFS), Forest Fire (FF) and Snowball Sampling (SBS). Graph 
traversals, especially BFS, are very popular and widely used 
for sampUng large networks, e.g. WWW m or OSNs fTOl - 
1 1 2|. One reason is that BFS is well-known (a textbook 
technique) and easy to understand. Another reason is that 
(incomplete) BFS collects a full view (all nodes and edges) 
of some particular region in the graph, which is sometimes 
believed to be representative of the entire graph. E.g., a BFS 
sample of a lattice is a (smaller) lattice. 

Unfortunately, this intuition often fails. It was observed 
empirically that BFS introduces a bias towards high-degree 
nodes II9I13I14II . We also confirmed this fact in a recent 
measurement of Facebook ||5l, where our BFS crawler found 
the average node degree (fc^^^) ~ 324, while the real value 
is only (fc) ~ 94, i.e., about 3.5 times smaller! 

Given the popularity of BFS on one hand, and its bias on 
the other hand, it is surprising that we still know relatively 
little about the statistical properties of node sequences returned 
by BFS. Indeed, sampling without replacement introduces 
complex dependencies, no rigorous analytical explanation of 
the observed biases of BFS was available to date. 

Our work is a first step toward understanding the statistical 
characteristics of incomplete BFS sampling. In particular, we 
calculate precisely the node degree distribution expected to 
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be observed by BFS as a function of the fraction of sampled 
nodes in a random graph RG{pk) with a given (and arbitrary) 
degree distribution pk- We accompany this central result 
with additional related contributions. First, we show that in 
RG{pk), BPS is equivalent to other graph traversal techniques, 
such as Depth First Search (DFS), Snowball Sampling, and 
Forest Fire (FF). Second, we compare the bias of BFS (and 
other traversal techniques) to that of random walks. As shown 
in Fig.[T]and as also formally demonstrated in this paper, in the 
beginning of the exploration process, BFS exhibits exactly the 
same bias as the Random Walk (RW). With increasing fraction 
of sampled nodes /, this bias monotonically decreases. When 
the BFS is complete (/ = 1), there is no bias, as it is can 
also be achieved by the Metropolis-Hastings Random Walk 
(MHRW). Moreover, given a biased sample, we derive an 
unbiased estimator of the original node degree distribution. 

In addition, we use simulation to confirm our analysis and 
investigate the effect of graph properties, such as assortativity, 
not captured directly by RG{pk)- We complement it with real- 
world measurements of the Facebook social network. 

Scope. Our theoretical results hold for the random graph 
model RG{pk) described in Section |IV] We study some 
extensions of this model in simulations in Section IVIII We 
also restrict our attention to BFS sampling of static graphs. 

The outline of the paper is as follows. Section HIl discusses 
related work. Section |III] presents the graph sampling algo- 
rithms under study. Section |IV] presents the random graph 
model used in this paper Section |V] analyzes the expected 
degree distribution of various graph sampling techniques; in 
particular the main results related to BFS are derived in 
Section |V]B. Section |VT] shows how to correct for the bias. 
Section FVIll presents simulation results. Section rVlIII demon- 
strates the above ideas by sampling a real world network, 
Facebook, and provides hints for graph sampling in practice. 
Section HXl concludes and outlines future work. 

II. Related Work 

BFS used in practice. BFS is widely used today for explor- 
ing large networks, such as OSNs. The following list provides 
some examples but is by no means exhaustive. In lITOl . Ahn et 
al. used BFS to sample Orkut and MySpace. In HI] and ifTSI . 
Mislove et al. used BFS to crawl the social graph in four pop- 
ular OSNs: Flickr, LiveJournal, Orkut, and YouTube. In lfT2l . 
Wilson et al. measured the social graph and the user interaction 
graph of Facebook using several BFSs, each BFS constrained 
in one of the largest 22 regional Facebook networks. In our 
recent work 15], we have also crawled Facebook using various 
sampling techniques, including BFS, RW and MHRW. It has 
been empirically observed that incomplete BFS and its variants 
introduce bias towards high-degree nodes 1911311411 . We also 
confirmed this in Facebook Is], an observation that in fact 
inspired this paper 

Analyzing BFS. To the best of our knowledge, the sampling 
bias of BFS has not been analyzed so far. |,16il and [17il are 
the closest related papers to our methodology. The original 
paper by Kim ifTSI analyzes the size of the largest connected 



component in classic Erdos-Renyi random graph by essentially 
applying the configuration model with node degrees chosen 
from a Poisson distribution. To match the stubs (or 'clones' 
in lfT6l ) uniformly at random in a tractable way, Kim proposes 
a "cut-off line" algorithm: he first assigns each stub a random 
index from [0,np], and next progressively scans this interval. 
Achlioptas et al. used this powerful idea in ifTT] to study the 
bias of traceroute sampling in random graphs with a given 
degree distribution. The basic operation in ifTTll is traceroute 
{i.e., "discover a path") and is performed from a single node 
to all other nodes in the graph. The union of the observed 
paths forms a "BFS-tree", which includes all nodes but misses 
some edges {e.g., those between nodes at the same depth in the 
tree). In contrast, the basic operation in the traversal methods 
presented in our paper is to discover all neighbors of a node, 
and it is applied to all nodes in increasing distance from 
the origin. Another important difference is that iflTl studies a 
completed BFS-tree, whereas we study the sampling process 
when it has visited only a fraction / < 1 of nodes; a completed 
BFS (/= 1) is trivial in our case (it has no bias). 

There is also a large body of literature on unequal proba- 
bility sampling without replacement lITSll . Although, at first, 
it seems to be a promising path to follow, to the best of our 
knowledge, none of the existing results is directly applicable 
to our problem. This is because, speaking in the terms used 
later in this paper, the available results either (i) require 
the knowledge of qk{f) as an input, or (ii) propose how to 
calculate qk{f) for the first two nodes only. 

Another recent paper related to BFS bias is iT9| . The 
paper is about Snowball Sampling i20i . which is similar to 
BFS, and proposes a heuristic approach to correct the degree 
biases in ith generation of Snowball based on the values 
measured in generation i— 1. The authors show by simulation 
that this technique performs moderately well, especially when 
a significant fraction of nodes have been covered. 

Random Walks. Simple and metropolized random walks are 
also used for crawling OSNs Il5l6] . P2P networks |l2]-||4], the 
web in and large graphs in general fT\. Random walks are 
well-studied [8j, their bias is known and can be corrected. 
Random walks are tiot the focus of the paper but are used as 
baseline for comparison. 

III. Graph exploration techniques 

Let G = (y, i?) be a connected graph with the set of vertices 
V, and a set of undirected edges E. Initially, G is unknown, 
except for one (or some limited number of) seed node(s). 
When sampling through graph exploration, we begin at the 
seed node, and we recursively visit (one, some or all) of its 
neighbors. We distinguish two main categories of exploration 
techniques: with and without replacement. 

A. Exploration with replacement (random walks) 

Exploration with replacement, or simply a walk, allows 
revisiting the same node many times. Consider the following 
classic examples: 
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1) Random Walk (RW): In this classic sampling tech- 
nique m, we start at some seed node. At every iteration, the 
next-hop node v is chosen uniformly at random among the 
neighbors of the current node u. It is easy to see that RW 
introduces a linear bias towards nodes of high degree jS]. 

2) Metropolis Hastings Random Walk (MHRW): In this 
technique, as in RW, the next-hop node w is chosen uniformly 
at random among the neighbors of the current node u. How- 
ever, with a probability that depends on the degrees of w and u, 
MHRW performs a self-loop instead of moving to w. More 
specifically, the probability Pu,w of moving from u to w is as 
follows CT: 

{Y- ■ min(l, if w is a neighbor of u, 

1 - Y^y^u P^.y if w = u, (1) 
otherwise, 

where ky is the degree of node v. Essentially, MHRW reduces 
the transitions to high degree nodes and thus eliminates the 
degree bias of RW. This property of MHRW was recently 
exploited in various network sampling contexts II2I3I5I6II . 

3) Respondent-Driven Sampling (RDS): RDS was proposed 
and studied in the field of social sciences to penetrate hidden 
populations, such as that of drug addicts 022I23II . In the 
network sampling terminology, at each iteration RDS selects 
randomly exactly n neighbors (typically rt ~ 3) of the current 
node u and schedules them to visit later. RDS visits the 
nodes in the order they were scheduled. Thus, RDS is a 
modification of Snowball Sampling (described below) that 
allows node revisiting. Q RDS introduces a degree bias that is 
known and can be corrected for It was demonstrated in 1231 
on the example with n= 1, which reduces RDS precisely to 
Random Walk (RW). This approach was recently tested in 
on various graph models and unstructured P2P networks. 

B. Exploration without replacement (graph traversals) 

In contrast, exploration without replacement, or graph 
traversal, never revisits the same node and. At the end of the 
process, and assuming that the graph is connected, all nodes 
are visited. 

1) Breadth First Search (BFS): BPS is a classic graph 
traversal algorithm that starts from the seed and progressively 
explores all neighbors. At each new iteration the earliest 
explored but not-yet-visited node is selected next. Thus, BFS 
discovers all nodes within some distance from the seed. 

2) Depth First Search (DFS): This technique is similar to 
BFS, except that at each iteration we select the latest explored 
but not-yet-visited node. As a result, DFS explores first the 
nodes that are faraway (in the number of hops) from the seed. 

'in practical RDS surveys in liuman populations, nodes (people) are not 
revisited. However, the revisiting assumption is necessary to formally connect 
for the degree bias |23| . The authors of |23"i argue that this approximation is 
valid if the sample size is relatively small compared to the population size. 
In this paper we formally confirm this claim. 
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Notation Summary. 'Observed' means calculated directly 
from the sample. 



3) Forest Fire (FF): FF is a randomized version of BFS, 
where for every neighbor v of the current node, we flip a coin, 
with probability of success p, to decide if we explore v. FF 
reduces to BFS for p= 1. It is possible that this process dies 
out before it covers all nodes. In this case, in order to make FF 
comparable with other techniques, we revive the process from 
a random node already in the sample. Forest Fire is inspired by 
the graph growing model of the same name proposed in ||24| 
and is used as a graph sampling technique in Q. 

4) Snowball Sampling (SBS): Snowball Sampling is a pre- 
cursor of RDS and a term loosely used for BFS-like traversal 
techniques. According to a classic definition by Goodman 1201 . 
an n-name Snowball Sampling is similar to BFS, but at 
every node v, not all ky, but exactly n neighbors are chosen 
randomly out of all k^ neighbors of v. These n neighbors 
are scheduled to visit, but only if they have not been visited 
before. 

IV. Graph MODEL i?G(p/c) 

A basic important graph property is the node degree dis- 
tribution Pk, i.e., the fraction of nodes with degree equal 
to k, for all fc > 00 Depending on the network, the degree 
distribution can vary, ranging from constant-degree (in regular 
graphs), a distribution concentrated around the average value 
{e.g., in Erdos-Renyi random graphs or in well-balanced 
P2P networks), to heavily right-skewed distributions with k 
covering several decades (in WWW, unstructured P2P, Internet 
at the Autonomous System level, OSNs). We handle all these 
cases by assuming that we are given any fixed node degree 
distribution pk. Other than that, the graph G is completely 
random. That is, G is drawn uniformly at random from the 
set of all multigraph^ with degree distribution p^ ■ We denote 
this model by RG{pk)- 

We use a classic technique to generate RG{pk), called 
configuration model 1251261 : each node v is given ky "stubs" 
(or "edges-to-be"). Next, all these J^vev = 2|i?| stubs are 
randomly matched in pairs, until all stubs are exhausted (and 
\E\ edges are created). In Fig. |2] (ignore the rectangular interval 
[0,1] for now), we present four nodes with their stubs (left) 
and an example of their random matching (right). 

^As we define as a 'fraction', not the 'probability', determines the 
degree sequence in the graph, and vice versa. 

^A multigraph is a graph that accepts multiple edges and self-loops. 
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V. Analyzing the Node Degree Bias 

In this section, we study the node degree bias observed 
when the graph exploration techniques of Section |lll] are run 
on the random graph RG{pk) of Section [TVl In particular, we 
derive the node degree distribution and the average node 
degree (fc*) expected to be observed, as a function of the 
original degree distribution pk and, in the case of BFS, of the 
fraction of sampled nodes /. 

A. Exploration with replacement (walks) 

We begin by summarizing the relevant results known for 
walks, in particular for RW and MHRW. They will serve as 
a reference point for our main analysis of graph traversals in 
the next section. 

1) Random Walk (RW): Random walk have been widely 
studied; see for an excellent survey. In any given con- 
nected and aperiodic graph, the probability of being at a 
particular node v converges at equilibrium to the stationary 
distribution 11^— 2TBJ- Therefore, the expected observed degree 
distribution qk is 

9fe = "^^v ■ l{fc„=fc} 



2|£:| 



k 



2\E 



kpk 
(fc)' 



(2) 



where (fc) is the average node degree in G. Eq. (|2]l is essen- 
tially similar to calculation for RDS in 1231271 . As this holds 
for any fixed (and connected and aperiodic) graph, it is also 
true for all connected graphs generated by the configuration 
model. Consequently, the expected observed average node 
degree is 



(fc*) 



Ekk'Pk (fc^) 



(fc) 



(fc) 



(3) 



where (fc^) is the average squared node degree in G. We show 



this value in Fig. [T] 

2) Metropolis Hastings Random Walk (MHRW): It is easy 
to show that the transition matrix Pu,w shown in Eq.([rii 
leads to a uniform stationary distribution iTy — II2TI . and 
consequently: 

Qk = Pk (4) 

(fc*) = ^fc-pfc = (fc). (5) 

fc 

In Fig. [T] we show that MHRW estimates the true mean. 

B. Exploration without replacement (Main Result) 

In both RW and MHRW the nodes can be revisited. So 
the state of the system at iteration i + 1 depends only on 
iteration i, which makes it possible to analyze as Markov 
Chains. In contrast, graph traversals do not allow for node 
revisits, which introduces crucial dependencies between all 
the iterations and significantly complicates the analysis. To 
handle these dependencies, we adopt an elegant technique 
recently introduced in lfT6l (to study the size of the largest 
connected component) and extended in ifTTl (to study the bias 



of traceroute sampling). However, our work differs in many 
aspects from both fTSj and fTTl, which we comment in detail 
in the related work Section 

1) Exploration without replacement at the stub level: We 
begin by defining Algorithm 1 (below) - a general graph 
traversal technique that collects a sequence of nodes 5, without 
replacements. To be compatible with the configuration model 
(see Section lIVI i. we are interested in the process at the stub 
level, where we consider one stub at a time, rather than one 
node at a time. An integral part of the algorithm is a queue Q, 
that keeps the discovered, but still not-yet-followed stubs. We 
start the algorithm by adding to Q all the stubs of some initial 
node ui, and by setting 5*= [vi\. Next, at every iteration, we 
pop one stub a from Q, and follow it to discover its partner- 
stub 6, and 6's owner v{h). If node v{b) is not yet discovered, 
i.e., if v{h) ^ S, then we append v{h) to S and we add to Q 
all other stubs of v{h). More formally: 

Algorithm 1 Stub-Level Graph Traversal 



S ^ [vi] and Q [all stubs of vi\ 
while Q is nonempty do 
Pop a from Q 
Discover a's partner h 
\iv{b)iS then 
Append v{b) to S 
Add to Q all stubs of v{b) except b 
else 

Remove 6 from Q 
end if 
end while 



Depending on the scheduling discipline for the elements 
in Q (line 3), Algorithm 1 implements BFS (for a first-in first 
out scheduling), DFS (last-in first-out) or Forest Fire (first- 
in first-out with randomized stub losses). Line 9 guarantees 
that the algorithm never tracebacks the edges, i.e., that stub a 
popped from Q in line 3 never belongs to an edge that has 
already been traversed in the opposite direction. 

2) Discovery on the fly: In line 4 of Algorithm 1, we follow 
stub a to discover its partner b. In a fixed graph G, this 
step is deterministic. In the configuration model RG{pk), a 
fixed graph G is obtained by matching all the stubs uniformly 
random. Next we can sample this fixed graph and average it 
over the space of all the random graphs RG{pk) that have just 
been constructed. Unfortunately, this quickly leads to complex 
combinatorial problems. We adopt therefore an alternative and 
more tractable construction of a fixed graph with is an iterative 
sampling from the set of random graphs, by selecting b 'on the 
fly' (i.e, every time line 4 is executed), uniformly at random 
from all the unmatched stubs. By the principle of deferred 
decisions ||281 . these two approaches are equivalent. 

3) Breaking the dependencies: There is still one problem 
with the 'on the fly' method. It selects stub b uniformly 
at random from all the unmatched stubs. This introduces 
dependencies between the stubs and across all the iterations. 



0i 





time t (index) 



current time t 



time t (index) 



Fig. 2. An illustration of the stub-level, on-the-fiy graph exploration without replacements. In this particular example, we show an execution of BFS starting 
at node vi. Left: Initially, each node v has ky stubs, where ky is a given target degree of v. Each of these stubs is assigned a real-valued number drawn 
uniformly at random from the interval [0, 1] shown below the graph. Next, we follow Algorithm 1 with a starting node vi. The numbers next to the stubs of 
every node v indicate the order in which these stubs are added to the queue Q. Center: The state of the system at time t. All stubs in [0, t] have already 
been matched (the indices of matched stubs are set in plain line). All unmatched stubs are distributed uniformly at random on (i, 1]. This interval can contain 
also some (here two) already matched stubs. Right: The final result is a realization of a random graph G with a given node degree sequence (i.e., of the 
configuration model). G may contain self-loops and multiedges. 



We remedy this by implementing the 'on the fly' approach 
as follows. First, we assign each stub a real-valued index t 
drawn uniformly at random from the interval [0, 1]. Then, 
every time we process line 4, we pick b as the unmatched stub 
with the smallest index. We can interpret this as a continuous- 
time process, where we determine progressively the partners 
of stubs popped from queue Q, by scanning the interval from 
'time' t= to t= 1 in a search of unmatched stubs. Because 
the indices chosen by the stubs are independent from each 
other, the above trick breaks the dependence between the stubs, 
which is a crucial for making this approach tractable. 

In Fig. 12] we present an example execution of Algorithm 1, 
where line 4 is implemented as described above. 

4) Expected sampled degree distribution q^: Now we are 
ready to derive the expected observed degree distribution qk- 
Recall that all the stub indices are chosen independently and 
uniformly from [0, 1]. A vertex v with degree k is not sampled 
yet at time t if the indices of all its k stubs are larger than t, 
which happens with probability — So the probability 
that V is sampled before time t is l — {\ — t)^. Therefore, the 
expected fraction of vertices of degree k sampled before t is 

fk{t)=Pk{l-{l-tt). (6) 

By normalizing (|6]l, we obtain the expected observed (sam- 
pled) degree distribution at time t: 

hit) Pk{i-{i-m 



Qkit) = 



Unfortunately, it is difficult to interpret qk{t) directly, be- 
cause t is proportional neither to the number of matched edges 
nor to the number of discovered nodes. Recall that our primary 
goal is to express q^ as a function of fraction / of covered 
nodes. We achieve this by calculating f{t) - the expected 
fraction of nodes, of any degree, visited before time t 



(8) 



Because pk > 0, and pk > for at least one A: > 0, the term 
^^Pfc(l— t)'^ is continuous and strictly decreasing from 1 to 
with t growing from to 1. Thus, for / G [0, 1] there exists a 
well defined function t{f) that satisfies Eq.®, i.e., the inverse 



of f{t). Although we cannot compute t{f) analytically (except 
in some special cases such as for k < 4), it is straightforward 
to find it numerically. Now, we can rewrite Eq.(|7ll as 



Qkif) = 



Pk{i-{i-t{m 

E,P;(i-(i-t(/))0' 



(9) 



which is the expected observed degree distribution after cov- 
ering fraction / of nodes of graph G. 

5) Equivalence of traversal techniques under RW{pk).' 
An interesting observation is that, under the random graph 
model RW{pk), all common traversal techniques (BFS, DFS, 
FF, SBS, . . . ) are subject to exactly the same bias. This is 
because the sampled node sequence S is fully determined by 
the choice of stub indices on [0, 1], independently of the way 
we manage the elements in Q. 

This observation applies to the sequence S only - the 
subgraphs of G that we actually sample by BFS and DFS, 
for example, might significantly differ. 

6) Equivalence to weighted sampling without replacement: 
Consider a node v with a degree The probability that v is 
discovered before time t, given that it has not been discovered 
before to < is 



^{v before time t \ v not before to) = 1 



1-t 

1-tn 



(10) 



(7) We now take a derivative 



^ of the above equation, 
which results in the conditional probability density function 
kvijE^)''"^^ ■ Setting t— > to (but keeping t >to), reduces it 
to ki,, which is the density of probability that v is sampled 
at to, given that it has not been sampled before. This means 
that at every point in time, out of all nodes that have not yet 
been selected, the probability of selecting v is proportional 
to its degree fc„. Therefore, this scheme is equivalent to node 
sampling weighted by degree, without replacements. 

7) Equivalence to RWfor /— > O." Finally, for /— !> (and thus 
t^O), we have 1-(1— i)'^ ~ k, and Eq. ^ simpHfies to Eq. 
This means that in the beginning of the sampling process, 
every traversal technique is equivalent to RW, as shown in 
Fig. 1 for 
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8) (fc*) is decreasing in f: Let us denote hy Xi ^ V 
the ith selected node. As we have shown above that our 
procedure is equivalent to node degree weighted sampling 
without replacements, we can write: 

P(Xi=u) = ^ 

P(X2= ^ " ' ^" 



-w) 



z 



Because for any two 



where z = 2\E\ and ^ Y.u^w 
nodes a and h, we have ah~aa — z{ka~ki,)/{{z~ka){z~kh)), 
aw strictly decreases with growing fc^. As a result, P{X2) 
is more concentrated around nodes with smaller degrees than 
is F{Xi), implying that K[kx2] < ^[kxA- We can use an 
analogous argument at every iteration i < \V\, which allows 
us to say that E[fcxJ < ]E[fcxi_i]- In other words, (fc*)(/) is 
a decreasing function of /. 

A practical consequence is that many short traversals (e.g., 
BFS-es) are more biased than a long one, with the same total 
number of samples. 

C. Comments on the starting node and graph connectivity 

In all exploration techniques, the choice of the starting 
node vi can have a strong effect on the first iterations. 
For example, if vi is a low-degree node then the degree 
distribution qk sampled in the first iterations is naturally biased 
toward lower degrees. In fixed graphs, this problem is usually 
addressed by selecting vi as the last node of an appropriately 
long "burn-in" run of RW (or MHRW when this technique is 
used), started at an arbitrary node. In the case of a random 
graph RG{pk), the problem is even simpler, because already 
the second node of RW follows -Ky— ^e\' which reduces the 
burn-in period to one hop only. 

Another issue is that the configuration model RG{pk) might 
result in a graph G that is not connected. In this case, 
every exploration technique covers only the component C in 
which it was initiated; consequently, the process described in 
Section IV-BSI stops once C is covered. 

D. A convenient interpretation 

It might be sometimes convenient to split the exploration 
techniques, in RG{pk), into three simple classes, with respect 
to the node degree bias they experience. These classes can be 
defined as ways to sample nodes from a pool of all nodes 
V, independently of the actual topology of G. MHRW is 
equivalent to uniform node sampling with replacement. RW 
is equivalent to degree-weighted node sampling with replace- 
ment. Finally, all traversal techniques equivalent to degree- 
weighted node sampling without replacement. The above 
holds strictly for RG{pk) only, but it can be an insightful 
interpretation, in general. 

VI. Correcting for node degree bias 

In the previous section we derived the expected observed 
degree distribution qk as a function of the original degree 
distribution pk, for three general graph exploration techniques. 



The distribution q^ is usually biased towards high-degree 
nodes. In this section, we derive unbiased estimators pk and 
(fc) of the original degree distribution pk and its mean (fc), 
respectively. 

Let 5 C y be a sequence of vertices that we sampled. 
Based on S, we can estimate q^ as 

number of nodes in S with degree fc 



\S\ 



(11) 



A. Random Walk (RW) 

In order to estimate pk based on qk, consider again Eq.©, 
which says that qk is proportional to kpk. Therefore, pk is 
proportional to qk/k, and pk is proportional to qk/k which 
allows us to write (similarly to OI23II ): 



Pk 



(12) 



where J^i ^ is a normalizing constant. From Eq.(fT2b. we can 
estimate the average node degree as 



(fc) = Y^kpk = E 



1^1 



k \ I " / ^vGS fet 

B. Metropolis Hastings Random Walk (MHRW) 
In this case, equations (|4|i and Q trivially yield 



Pk 
(fc) 



qk, 

E 

k 



and 

kpk = y^fcgfc- 



(13) 



(14) 
(15) 



C. Graph traversal 

From Eq. ^ we know that Pk{f) is proportional to qk /{I 
(1-i (/))'')■ Consequently, 

qk sr^ qi \ 



Pk 



1 -(!-*(/))'= 



^ 1 



a-tif)y 



(16) 



However, in order to evaluate this expression, we need to 
evaluate t{f), that, in turn, requires pk- We can solve this 
chicken-and-egg problem iteratively, if we know the real 
fraction y''<=°' of covered nodes, or equivalently the graph 
size \V\. First, we evaluate Eq.(fT6]l for some values of t and 
feed the resulting pk's into Eq. ^ to obtain the corresponding 
/'s. By repeating this process, we can drive the values of / 
arbitrarily close to and thus find the desired pk. 

In summary, for graph traversal techniques, Eq.(fT6]l shows 
how to estimate the original degree distribution pk given 
that the real graph coverage f'^^°-\ which is often the case 
in practice. Of course, based on our estimator pk, we can 
calculate the average node degree as (fc) = J^k kpk- 

VII. Simulation results 

In this section, we implement and simulate the considered 
sampling techniques, namely BFS, DFS, FF (with p = 0.5), 
RW and MHRW. The simulations confirm our analytical 
results. More importantly, in simulations we can study the 
effect of topological properties, such as of assortativity, that 
are not directly captured by the random graph model RG{pk). 
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Fig. 3. Comparison of sampling tecliniques in tlieory and in simulation. Left: Observed (sampled) average node degree (fc*) as a function of the fraction 
/ of sampled nodes, for various sampling techniques. The results are averaged over 1000 graphs with 10000 nodes each, generated by the configuration model 
with a fixed heavy-tailed degree distribution pj^ (shown on the right). Right: Real, expected, and estimated (corrected) degree distributions for selected 
techniques and values of / (other techniques behave analogously). We obtained analogous results for other degree distributions and graph sizes \V\. The 
term {k) is the real average node degree, and (fc^) is the real average squared node degree. 



Average node degree, assortativity r > 



Average node degree, assortativity r < 



expected (analytic) 
random walk (RW) 
Metropolis-Hastings ra 
Breadth First Search (BFS) 
Depth First Search (DFS) 
Forest Fire (FF) 
weighted random sample 




0.02 0.04 0.06 0.08 

/ - fraction of covered nodes 




/ - fraction of covered nodes 



Fig. 4. The effect of assortativity r on the results. First, we use the configuration model with the same degree distribution pj. as in Fig. |3] (and the same 
number of nodes |V| = 10000) to generate a graph G. Next, we apply the pairwise edge rewiring technique 1291 to change the assortativity r of G without 
changing node degrees. This technique iteratively takes two random edges {v\,w\} and {v2,W2}, and rewires them as {v\,W2} and {v2,'W\} only if it 
brings us closer to the desired value of assortativity r. As a result, we obtain graphs with a positive (left) and negative (right) assortativity r. Note that for a 
better readability, we present only the values of / G [0,0.1], i.e., ten times smaller than in Fig.|3] 



A. Estimating Degree Distributions and Average Degree 

Fig. [3] verifies all the formulae derived in this paper, for 
a random graph with a given powerlaw distribution. The 
analytical expectations are plotted in thick plain lines in the 
background and the averaged simulation results are plotted 
in thinner lines lying on top of them. We observe almost a 
perfect match between theory and simulation in estimating 
the sampled degree distribution qk (Fig. [3] right) and its mean 
(fc*) (Fig. |3] left). Indeed, all traversal techniques follow the 
same curve (as predicted in IV-B5b . that initially coincides 
with that of RW (see |V-B7t and is monotonically decreasing 
in / (see |V-B8t . We also show that degree weighted node 
sampling without replacements exhibits exactly the same bias 
(see IV-B6b . Finally, applying the estimators pk derived in 
Section IVll corrects for the bias of qk- 

B. The effect of degree-degree correlations (assortativity r) 

Depending on the type of network, nodes may tend to 
connect to similar or different nodes. For example, in most 
social networks high degree nodes tend to connect to other 



high degree nodes 1301 . Such networks are called assorta- 
tive. In contrast, biological and technological networks are 
typically disassortative, i.e., they exhibit significantly more 
high-degree-to-low-degree connections. This observation can 
be quantified by calculating the assortativity coefficient r 1301 . 
which is the correlation coefficient computed over all edges 
(i.e., degree-degree pairs) in the graph. Values r < 0, r > and 
r = indicate disassortative, assortative and purely random 
graphs, respectively. 

For the same initial parameters as in Fig. [3] (pk, \V\), 
we simulated different levels of assortativity. Fig. |4] shows 
the results. Graph assortativity r strongly affects the first 
iterations of traversal techniques. Indeed, for assortativity 
r > (Fig. in left), the degree bias is even stronger than 
for r = (Fig. |3] left). This is because the high-degree 
nodes are now interconnected more densely than in a purely 
random graph, and are thus easier to discover by sampling 
techniques that are inherently biased towards high degree 
nodes. Interestingly, Forest Fire is by far the most affected. 



g 





UNI 


RW 


BFS28 


BFSi 


MHRW 




982K 


2.26M 


28X81K = 2.26M 


1.19M 


2.26M 


/ 


0.44% 


1.03% 


28x0.04% 


0.54% 


1.03% 



TABLE II 

Facebook measurements - data set overview. \S\ and / are the 

ABSOLUTE AND RELATIVE LENGTHS OF THE COLLECTED SAMPLES. FOR 
MORE DETAILS REFER TO [5). 



A possible explanation is that under Forest Fire, low-degree 
nodes are likely to be completely skipped by the first sampling 
wave. Not surprisingly, a negative assortativity r < has 
the opposite effect: every high-degree node tends to connect 
to low-degree nodes, which significantly slows down the 
discovery of the former. 

In contrast, random walks RW and MHRW are not affected 
by the changes in assortativity. This is expected, because 
their stationary distributions hold for any fixed (connected and 
aperiodic) graph regardless of its topological properties. 

C. Other graph properties 

We also attempted to simulate the effect of other basic 
graph properties, such as clustering or modularity. However, 
all these properties are interdependent, which makes it difficult 
to interpret the results. For example, ||3T1 described recently 
an extension of the configuration model to generate random 
graphs with a given level of clustering c. However, the assorta- 
tivity r turns out to strongly depend on c. Rather than showing 
preliminary results, we decided to defer them to future work, 
where we are planning to incorporate some of these additional 
topological properties in our analytical model. 

vni. Real life example: Sampling of Facebook 

In this section we apply and test the previous ideas in 
a real-life large-scale system - the Facebook social graph. 
With 250h- millions of active users, Facebook is currently the 
largest online social network. Crawling the entire topology of 
Facebook would require downloading about 50TB of HTML 
data m, which makes sampling a very practical alternative. 

A. Data collection 

We have implemented a set of crawlers to collect the 
samples of Facebook (FB) according to the UNI, BFS, RW, 
MHRW techniques. The details of our implementation are 
described in |^. The collected data sets are summarized in 
Table M 

UNI refers to a uniform sample of FB users. It was 
obtained by uniformly sampling the entire FB userlD space 
and discarding non allocated userlDs. This is a trivial version 
of rejection sampling and guarantees a uniform sampling of 
the existing users, regardless of their actual distribution in 
the userlD space. UNI gives a high quality estimation of pk 
and (fc), mainly thanks to a large number of samples IS*]. 
Therefore, we use UNI as ground truth for comparison of 
various techniques. 

We ran two types of BFS crawling. BFS28 consists of 28 
small BFS-es initiated at 28 randomly chosen nodes from UNI, 
which allowed us to easily parallelize the process. Moreover, 
at the time of data collection, we (naively) thought that this 



UNI RW BFS28 BFSi MHRW 

{k*) sampled 94l 338l) 323:9 283:9 95l 

(fc*) expected - 329.8 329.1 © 328.7 |9) 94.1 (5) 

{k) estimated - 93.9 (13) 85.4 (16) 72.7 (16) 95.2 (15) 

table III 

Facebook measurements - average node degree. Ave degree: sampled 

(ROW 1), EXPECTED (ROW 2) AND CORRECTED (ROW 3) FOR VARIOUS 
TECHNIQUES. FOR EACH EXPECTED AND CORRECTED VALUE, WE GIVE IN 
PARENTHESIS THE FORMULA USED TO COMPUTE IT. 

would reduce the BFS bias. After gaining more insight into 
the process (which, nota bene, motivated this paper), we 
collected a single large BFSi, initiated at a randomly chosen 
node from UNI. The implementation of RW and MHRW is 
straightforward. 

B. Results 

We present the Facebook sampling results in Table |III] and 
in Fig. |5] The first row of Table |lll] shows the average node 
degree (fc*) observed (sampled) by several techniques. The 
value sampled by UNI is (fc*)=94.1, which we interpret as 
the real value (k). MHRW, as expected, recovers a similar 
value. In contrast RW and BFS are both biased towards high 
degrees by a factor larger than three! The degree bias of RW 
is the largest. It drops very slightly under the (relatively very 
short) BFS28 crawl, which confirms our findings from IV-BTl 
BFSi, a sample 15 times longer than BFS28, is significantly 
less biased, which is in agreement with IV-B8I 

The second row shows the expected sampled average node 
degrees [i.e., our predictions of the values in the first row), 
assuming that the underlying Facebook topology is a ran- 
dom graph RG{pk) with degree distribution pk equal to 
that sampled by UNI. As expected, this works very well 
for RW. However, the values predicted for BFS significantly 
overshoot the reality. This is because Facebook is not a random 
graph RG{pk)- For example, Facebook, as most social net- 
works 1261 . is characterized by a high clustering coefficient c. 
We believe that it is possible to incorporate this fact in our 
analytical model, e.g., by appropriately stretching the function 
f{t) in Eq. ([8]). This is a main goal in our future work. 

Finally, in the last row of Table |III] we apply the estimators 
developed in Section |Vl] to correct the degree biases of RW 
and BFS. In the case of RW, the correction works very 
well. Unfortunately, for the BFS estimator the results are 
significantly worse, clearly for the reasons discussed in the 
previous paragraph. 

All the above observations hold not only for the average 
node degree, but also for the entire degree distribution, which 
is shown in Fig. |5] 

C. Practical recommendations 

BFS is strongly biased toward high degree nodes. It is 
possible to correct for this bias precisely when the underlying 
graph is a RG{pk) (which is not the case in practice). Also, 
in more realistic graphs, this bias can be corrected reasonably 
well for a very small sample size (as is the case for BFS28), 
where BFS is similar to RW (see Fig. [Til. On the other 
extreme, for very large sampling coverage, the bias of BFS 
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Degree distributions sampled in Facebook 




k - node degree 

Fig. 5. Facebook measurements - degree distribution. Crawlers used: 
UNI, RW and BFS. All plots are in log-log scale with logarithmic binning 
of data (we take the average of all points that fall in the same bin). We also 
correct these distributions, as described in Section fVll 

becomes relatively small and could be sometimes neglected 
(even without additional correction). However, in all other 
cases, the results become difficult to interpret. In contrast, both 
RW (equipped with a correction procedure) and MHRW are 
unbiased, regardless of the actual graph topology. Therefore, 
we recommend using RW and MHRW (with a slight advantage 
of RW |i3j) as general methods to sample the node properties. 

In contrast, RW and MHRW are not really useful when sam- 
pling non-local graph properties, such as the graph diameter or 
the average shortest path length. In this case, BFS seems very 
attractive, because it produces a full view of a particular region 
in the graph, which is usually a densely connected graph itself, 
and for which the non-local properties can be easily calculated. 
However, all such results should be interpreted very carefully, 
as they may be also strongly affected by the bias of BFS. For 
example, the graph diameter (usually) drops significantly with 
growing average node degree of a network. 

IX. Conclusion and Future Directions 

In this paper, we analyzed the bias in estimating node degree 
when BFS (and other graph traversal techniques that sample 
nodes without replacement) are used to crawl a large, static, 
undirected network that is modeled by a random graph with a 
given, arbitrary degree distribution. We also compared BFS 
and graph traversal techniques to the well-studied random 
walks, and we were able to explain many of the similarities 
and differences that were only empirically observed so far. To 
the best of our knowledge, this is a first step towards analyzing 
the bias of BFS sampling, which is widely used in practice. In 
future work, we plan to extend our theoretical framework and 
study the effect of topological properties other than the degree 
distribution (such as assortativity, clustering, or community 
structure) on the bias of BFS and other techniques. 

References 

[1] M. R. Henzinger, A. Heydon, M. Mitzenmacher, and M. Najork, "On 
near-uniform url sampling," in Proc. of WWW, 2000. 

[2] D. Stutzbach, R. Rejaie, N. Duffield, S. Sen, and W. WiUinger, "On 
unbiased sampling for unstructured peer-to-peer networks," in Proc. of 
IMC, 2006. 

[3] A. Rasti, M. Torkjazi, R. Rejaie, N. Duffield, W. Willinger, and 
D. Stutzbach, "Respondent-driven sampling for chai'acterizing unstruc- 
tured overlays," in INFOCOM Mini-Conference, April 2009. 



[4] C. Gkantsidis, M. Mihail, and A. Saberi, "Random walks in peer-to-peer 
networks," in Proc. of Infocom, 2004. 

[5] M. Gjoka, M. Kurant, C. T. Butts, and A. Markopoulou, "A walk 
in facebook: Uniform sampling of users in online social networks," 
http://ar.xiv. org/abs/0906. 0060, 2009. 

[6] B. Krishnamurthy, R Gill, and M. Arlitt, "A few chirps about twitter," 
in Proc. of WOSN, 2008. 

[7] J. Leskovec and C. Faloutsos, "Sampling from lai'ge graphs," in Proc. 
of ACM SIGKDD, 2006. 

[8] L. Lovasz, "Random walks on graphs, a survey," in Combinatorics, 1993. 

[9] M. Najork and J. L. Wiener, "Breadth-first search crawling yields high- 
quality pages," in Proc. of WWW, 2001. 
[10] Y. Ahn, S. Han, H. Kwak, S. Moon, and H. Jeong, "Analysis of Topo- 
logical Characteristics of Huge Online Social Networking Services," in 
Proc. of WWW. 2007. 
[11] A. Mislove, M. Marcon, K. F Gummadi, R Druschel, and S. Bhattachar- 
jee, "Measurement and Analysis of OnUne Social Networks," in Proc. 
of IMC, 2007. 

[12] C. Wilson, B. Boe, A. Sala, K. R Puttaswamy, and B. Y. Zhao, "User 
interactions in social networks and their implications," in Proc. of 
EuroSys, 2009. 

[13] S. H. Lee, R-J. Kim, and H. Jeong, "Statistical properties of sampled 

networks," Phys. Rev. E, vol. 73, p. 016102, 2006. 
[14] L.Becchetti, C.Castillo, D.Donato, and A.Fazzone, "A comparison of 

sampling techniques for web graph characterization," in LinkKDD, 2006. 
[15] A. Mislove, H. S. Koppula, K. R Gummadi, R Druschel, and B. Bhat- 

tacharjee, "Growth of the flickr social network," in Proc. of WOSN, 

2008. 

[16] J. H. Kim, "Poisson cloning model for random graphs," International 

Cofigress of Mathematicians (ICM), 2006 (preprint in 2004). 
[17] D. Achlioptas, A. Clauset, D. Kempe, and C. Moore, "On the bias 

of traceroute sampling: or, power-law degree distributions in regular 

graphs," in STOC, 2005. 
[18] M. Q. Shahbaz, "Sampling with unequal probabilities and without 

replacement," Ph.D. dissertation. 
[19] J. lUenberger, G. Flotterod, , and K. Nage, "An approach to connect bias 

induced by snowball sampling," Sunbelt Social Networks Conference, 

2009. 

[20] L. Goodman, "Snowball sampling," Annals of Mathematical Statistics, 

vol. 32, p. 148170, 1961. 
[21] W. Gilks, S. Richardson, and D. Spiegelhalter, Markov Chain Monte 

Carlo in Practice. Chapman and Hall/CRC, 1996. 
[22] D. Heckathom, "Respondent-driven sampling: A new approach to the 

study of hidden populations," Social Problems, vol. 44, p. 174199, 1997. 
[23] M. Salganik and D. Heckathom, "Sampling and estimation in hidden 

populations using respondent-driven sampling," Sociological Methodol- 
ogy, vol. 34, p. 193239, 2004. 
[24] J. Leskovec, J. Kleinberg, and C. Faloutsos, "Graphs over time: densi- 

fication laws, shrinking diameters and possible explanations," in KDD, 

2005. 

[25] M. Molloy and B. Reed, "A critical point for random graphs with a 

given degree sequence," pp. 161-179, 1995. 
[26] M. E. J. Newman, "The structure and function of complex networks," 

SIAM REVIEW, vol. 45, pp. 167-256, 2003. 
[27] , "Ego-centered networks and the ripple effect," Social Networks, 

vol. 25, pp. 83-95, 2003. 
[28] R. Motwani and P. Raghavan, Randomized Algorithms. Cambridge 

University Press, 1990. 
[29] S. Maslov and K. Sneppen, "Specificity and stability in topology of 

protein networks," Science, vol. 296, no. 5569, pp. 910-913, May 2002. 
[30] M. Newman, "Assortative mixing in networks," in Phys. Rev. Lett. 89, 

2002. 

[31] M. E. J., "Random graphs with clustering," Phys. Rev. Lett, (in press), 
2009. 



